Goto

Collaborating Authors

 Saskatoon


Selective Masking based Self-Supervised Learning for Image Semantic Segmentation

Wang, Yuemin, Stavness, Ian

arXiv.org Artificial Intelligence

This paper proposes a novel self-supervised learning method for semantic segmentation using selective masking image reconstruction as the pretraining task. Our proposed method replaces the random masking augmentation used in most masked image modelling pretraining methods. The proposed selective masking method selectively masks image patches with the highest reconstruction loss by breaking the image reconstruction pretraining into iterative steps to leverage the trained model's knowledge. We show on two general datasets (Pascal VOC and Cityscapes) and two weed segmentation datasets (Nassar 2020 and Sugarbeets 2016) that our proposed selective masking method outperforms the traditional random masking method and supervised ImageNet pretraining on downstream segmentation accuracy by 2.9% for general datasets and 2.5% for weed segmentation datasets. Furthermore, we found that our selective masking method significantly improves accuracy for the lowest-performing classes. Lastly, we show that using the same pretraining and downstream dataset yields the best result for low-budget self-supervised pretraining. Our proposed Selective Masking Image Reconstruction method provides an effective and practical solution to improve end-to-end semantic segmentation workflows, especially for scenarios that require limited model capacity to meet inference speed and computational resource requirements.


Full-Stack Alignment: Co-Aligning AI and Institutions with Thick Models of Value

Edelman, Joe, Zhi-Xuan, Tan, Lowe, Ryan, Klingefjord, Oliver, Wang-Mascianica, Vincent, Franklin, Matija, Kearns, Ryan Othniel, Hain, Ellie, Sarkar, Atrisha, Bakker, Michiel, Barez, Fazl, Duvenaud, David, Foerster, Jakob, Gabriel, Iason, Gubbels, Joseph, Goodman, Bryce, Haupt, Andreas, Heitzig, Jobst, Jara-Ettinger, Julian, Kasirzadeh, Atoosa, Kirkpatrick, James Ravi, Koh, Andrew, Knox, W. Bradley, Koralus, Philipp, Lehman, Joel, Levine, Sydney, Marro, Samuele, Revel, Manon, Shorin, Toby, Sutherland, Morgan, Tessler, Michael Henry, Vendrov, Ivan, Wilken-Smith, James

arXiv.org Artificial Intelligence

Beneficial societal outcomes cannot be guaranteed by aligning individual AI systems with the intentions of their operators or users. Even an AI system that is perfectly aligned to the intentions of its operating organization can lead to bad outcomes if the goals of that organization are misaligned with those of other institutions and individuals. For this reason, we need full-stack alignment, the concurrent alignment of AI systems and the institutions that shape them with what people value. This can be done without imposing a particular vision of individual or collective flourishing. We argue that current approaches for representing values, such as utility functions, preference orderings, or unstructured text, struggle to address these and other issues effectively. They struggle to distinguish values from other signals, to support principled normative reasoning, and to model collective goods. We propose thick models of value will be needed. These structure the way values and norms are represented, enabling systems to distinguish enduring values from fleeting preferences, to model the social embedding of individual choices, and to reason normatively, applying values in new domains. We demonstrate this approach in five areas: AI value stewardship, normatively competent agents, win-win negotiation systems, meaning-preserving economic mechanisms, and democratic regulatory institutions.


HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Raza, Shaina, Narayanan, Aravind, Khazaie, Vahid Reza, Vayani, Ashmal, Radwan, Ahmed Y., Chettiar, Mukund S., Singh, Amandeep, Shah, Mubarak, Pandya, Deval

arXiv.org Artificial Intelligence

Although recent large multimodal models (LMMs) demonstrate impressive progress on vision language tasks, their alignment with human centered (HC) principles, such as fairness, ethics, inclusivity, empathy, and robustness; remains poorly understood. We present HumaniBench, a unified evaluation framework designed to characterize HC alignment across realistic, socially grounded visual contexts. HumaniBench contains 32,000 expert-verified image question pairs derived from real world news imagery and spanning seven evaluation tasks: scene understanding, instance identity, multiple-choice visual question answering (VQA), multilinguality, visual grounding, empathetic captioning, and image resilience testing. Each task is mapped to one or more HC principles through a principled operationalization of metrics covering accuracy, harmful content detection, hallucination and faithfulness, coherence, cross lingual quality, empathy, and robustness.We evaluate 15 state-of-the-art LMMs under this framework and observe consistent cross model trade offs: proprietary systems achieve the strongest performance on ethics, reasoning, and empathy, while open-source models exhibit superior visual grounding and resilience. All models, however, show persistent gaps in fairness and multilingual inclusivity. We further analyze the effect of inference-time techniques, finding that chain of thought prompting and test-time scaling yield 8 to 12 % improvements on several HC dimensions. HumaniBench provides a reproducible, extensible foundation for systematic HC evaluation of LMMs and enables fine-grained analysis of alignment trade-offs that are not captured by conventional multimodal benchmarks. https://vectorinstitute.github.io/humanibench/


TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

Alonso, Iñigo, Miranda, Imanol, Agirre, Eneko, Lapata, Mirella

arXiv.org Artificial Intelligence

While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. The field of table understanding focuses on techniques for representing and interpreting tabular data to support a wide range of practical tasks such as question answering, summarization, and information extraction. Research in this area has traditionally representated tables as structured text, encoding their content and layout through linearized or graph-based representations (see Figure 1b; Herzig et al. 2020; Zhang et al. 2020; Liu et al. 2022). While this unimodal view remains effective in certain domains, many tables found in documents and webpages contain irregular structures, rely on visual formatting (e.g., merged cells, background colors, font variations), or embed multimodal elements such as images (see Figure 1a). Advances in Vision-Language Models (VLMs; Radford et al. 2021; Liu et al. 2023) have provided impetus for treating tables as images, eschewing the step of rendering them as text sequences (like Markdown or HTML). The conceptual simplicity of this approach, coupled with improved performance on several tabular tasks (Alonso et al., 2024; Zhou et al., 2025) has driven significant research interest (Zheng et al., 2024b; Su et al., 2024; Jiang et al., 2025) in Visual T able Understanding (also known as Multimodal T able Understanding). Visual representations of tables are not only merely convenient but in many cases necessary, particularly for VLM agents that interact with the world exclusively through pixels (e.g., on a screen) and must interpret tables directly in their visual form (Deng et al., 2023; Zheng et al., 2024a; Lu et al., 2024). Despite the growing relevance of VTU, there are few resources that support training models directly on image-based representations of tables. Existing benchmarks like MMTab (Zheng et al., 2024b) consist of web tables (e.g., from Wikipedia which is a common source for many tabular datasets), that are serialized and subsequently rendered as synthetic images (see Figures 1b,c). As a result, models trained on such data face a train-test mismatch, since the visual patterns learned from serialized renderings do not generalize well to naturally occurring tables failing to capture critical visual cues like subtle ruling lines, intricate merged cell layouts, background colors, font variations, or embedded images that are inherent to real-world table comprehension (compare Figure 1a and 1c).


ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports

Baharoon, Mohammed, Luo, Luyang, Moritz, Michael, Kumar, Abhinav, Kim, Sung Eun, Zhang, Xiaoman, Zhu, Miao, Alabbad, Mahmoud Hussain, Alhazmi, Maha Sbayel, Mistry, Neel P., Bijnens, Lucas, Kleinschmidt, Kent Ryan, Chrisler, Brady, Suryadevara, Sathvik, Jaliparthi, Sri Sai Dinesh, Prudlo, Noah Michael, Marino, Mark David, Palacio, Jeremy, Akula, Rithvik, Zhou, Di, Zhou, Hong-Yu, Hamamci, Ibrahim Ethem, Adams, Scott J., AlOmaish, Hassan Rayhan, Rajpurkar, Pranav

arXiv.org Artificial Intelligence

We introduce ReXGroundingCT, the first publicly available dataset linking free-text findings to pixel-level 3D segmentations in chest CT scans. The dataset includes 3,142 non-contrast chest CT scans paired with standardized radiology reports from CT-RATE. Construction followed a structured three-stage pipeline. First, GPT-4 was used to extract and standardize findings, descriptors, and metadata from reports originally written in Turkish and machine-translated into English. Second, GPT-4o-mini categorized each finding into a hierarchical ontology of lung and pleural abnormalities. Third, 3D annotations were produced for all CT volumes: the training set was quality-assured by board-certified radiologists, and the validation and test sets were fully annotated by board-certified radiologists. Additionally, a complementary chain-of-thought dataset was created to provide step-by-step hierarchical anatomical reasoning for localizing findings within the CT volume, using GPT-4o and localization coordinates derived from organ segmentation models. ReXGroundingCT contains 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs, covering diverse radiological patterns from 3,142 non-contrast CT scans. About 79% of findings are focal abnormalities and 21% are non-focal. The dataset includes a public validation set of 50 cases and a private test set of 100 cases, both annotated by board-certified radiologists. The dataset establishes a foundation for enabling free-text finding segmentation and grounded radiology report generation in CT imaging. Model performance on the private test set is hosted on a public leaderboard at https://rexrank.ai/ReXGroundingCT. The dataset is available at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.


CEPerFed: Communication-Efficient Personalized Federated Learning for Multi-Pulse MRI Classification

Li, Ludi, Mao, Junbin, Lin, Hanhe, Tian, Xu, Wu, Fang-Xiang, Liu, Jin

arXiv.org Artificial Intelligence

Abstract-- Multi-pulse magnetic resonance imaging (MRI) is widely utilized for clinical practice such as Alzheimer's disease diagnosis. To train a robust model for multi-pulse MRI classification, it requires large and diverse data from various medical institutions while protecting privacy by preventing raw data sharing across institutions. Although federated learning (FL) is a feasible solution to address this issue, it poses challenges of model convergence due to the effect of data heterogeneity and substantial communication overhead due to large numbers of parameters transmitted within the model. To address these challenges, we propose CEPerFed, a communication-efficient personalized FL method. It mitigates the effect of data heterogeneity by incorporating client-side historical risk gradients and historical mean gradients to coordinate local and global optimization. The former is used to weight the contributions from other clients, enhancing the reliability of local updates, while the latter enforces consistency between local updates and the global optimization direction to ensure stable convergence across heterogeneous data distributions. To address the high communication overhead, we propose a hierarchical SVD (HSVD) strategy that transmits only the most critical information required for model updates. Experiments on five classification tasks demonstrate the effectiveness of the CEPerFed method. The code will be released upon acceptance at https://github.com/LD0416/CEPerFed. Index Terms-- Personalized Federated Learning, HSVD, Dynamic Rank Selection, Multi-Pulse MRI.


We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

Sadr, Nikta Gohari, Heidariasl, Sahar, Megerdoomian, Karine, Seyyed-Kalantari, Laleh, Emami, Ali

arXiv.org Artificial Intelligence

Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian taarof, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce TaarofBench, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated "polite" by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.


Voice-guided Orchestrated Intelligence for Clinical Evaluation (VOICE): A Voice AI Agent System for Prehospital Stroke Assessment

Acosta, Julian, Adams, Scott, Kernbach, Julius, Hardy, Romain, Kim, Sung Eun, Luo, Luyang, Zhang, Xiaoman, Johri, Shreya, Baharoon, Mohammed, Rajpurkar, Pranav

arXiv.org Artificial Intelligence

We developed a voice-driven artificial intelligence (AI) system that guides anyone - from paramedics to family members - through expert-level stroke evaluations using natural conversation, while also enabling smartphone video capture of key examination components for documentation and potential expert review. This addresses a critical gap in emergency care: current stroke recognition by first responders is inconsistent and often inaccurate, with sensitivity for stroke detection as low as 58%, causing life-threatening delays in treatment. Three non-medical volunteers used our AI system to assess ten simulated stroke patients, including cases with likely large vessel occlusion (LVO) strokes and stroke-like conditions, while we measured diagnostic accuracy, completion times, user confidence, and expert physician review of the AI-generated reports. The AI system correctly identified 84% of individual stroke signs and detected 75% of likely LVOs, completing evaluations in just over 6 minutes. Users reported high confidence (median 4.5/5) and ease of use (mean 4.67/5). The system successfully identified 86% of actual strokes but also incorrectly flagged 2 of 3 non-stroke cases as strokes. When an expert physician reviewed the AI reports with videos, they identified the correct diagnosis in 100% of cases, but felt confident enough to make preliminary treatment decisions in only 40% of cases due to observed AI errors including incorrect scoring and false information. While the current system's limitations necessitate human oversight, ongoing rapid advancements in speech-to-speech AI models suggest that future versions are poised to enable highly accurate assessments. Achieving human-level voice interaction could transform emergency medical care, putting expert-informed assessment capabilities in everyone's hands.


Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models

Saha, Pranta, Reimer, Joyce, Byrns, Brook, Burbridge, Connor, Dhar, Neeraj, Chen, Jeffrey, Rayan, Steven, Broderick, Gordon

arXiv.org Artificial Intelligence

The use of generative artificial intelligence (AI) models is becoming ubiquitous in many fields. Though progress continues to be made, general purpose large language AI models (LLM) show a tendency to deliver creative answers, often called "hallucinations", which have slowed their application in the medical and biomedical fields where accuracy is paramount. We propose that the design and use of much smaller, domain and even task-specific LM may be a more rational and appropriate use of this technology in biomedical research. In this work we apply a very small LM by today's standards to the specialized task of predicting regulatory interactions between molecular components to fill gaps in our current understanding of intracellular pathways. Toward this we attempt to correctly posit known pathway-informed interactions recovered from manually curated pathway databases by selecting and using only the most informative examples as part of an active learning scheme. With this example we show that a small (~110 million parameters) LM based on a Bidirectional Encoder Representations from Transformers (BERT) architecture can propose molecular interactions relevant to tuberculosis persistence and transmission with over 80% accuracy using less than 25% of the ~520 regulatory relationships in question. Using information entropy as a metric for the iterative selection of new tuning examples, we also find that increased accuracy is driven by favoring the use of the incorrectly assigned statements with the highest certainty (lowest entropy). In contrast, the concurrent use of correct but least certain examples contributed little and may have even been detrimental to the learning rate.


From Semantic To Instance: A Semi-Self-Supervised Learning Approach

Najafian, Keyhan, Maleki, Farhad, Jin, Lingling, Stavness, Ian

arXiv.org Artificial Intelligence

Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large-scale datasets with pixel-level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is more significant in images with densely packed, self-occluded objects, which are common in agriculture. To address this challenge, we propose a semi-self-supervised learning approach that requires minimal manual annotation to develop a high-performing instance segmentation model. We design GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features. We develop a pipeline to generate semantic segmentation and then transform it into instance-level segmentation. The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state-of-the-art wheat head instance segmentation model with mAP@50 of 98.5%. Additionally, we assessed the proposed methodology on the general-purpose Microsoft COCO dataset, achieving a significant performance improvement of over 12.6% mAP@50. This highlights that the utility of our proposed approach extends beyond precision agriculture and applies to other domains, specifically those with similar data characteristics.